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This paper investigates some problems associated with an expression evaluation order that we 
call "future" order, which is different from call-by-name, call-by-value, and call-by-need. In 
future order evaluation, an object called a "future" is created to serve as the value of each 
expression that is to be evaluated and separate process is dedicated to its evaluation. This 
mechanism allows the fully parallel evaluation of the expressions in a programming language. 

We discuss an approach to a problem that arises in this context: futures which were thought to 
be relevant when they were created become irrelevant through not being needed later in the 
computation. The problem of irrelevant processes also appears in multiprocessing 
problem-solving systems which start several processors working on the same problem but with 
different methods, and return with the solution which finishes first. This parallel method 
strategy has the drawback that the processes which are investigating the losing methods must 
be identified, cleanly stopped, and the processors they are using re-assigned to more useful 
tasks. 

The solution we propose is that of incremental garbage collection. The goal structure of the 
solution plan should be explicitly represented in memory as part of the graph memory (like 
Lisp's heap) so that a garbage collection algorithm can discover which processes are performing 
useful work, and which can be recycled for a new task. 

An incremental algorithm for the unified garbage collection of storage and processes is 
described. 
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1. Introduction 

Processors are becoming very cheap and there is good evidence that this trend will continue in the 
next few years. As a result, there has been considerable interest in how to apply large numbers of 
processors to the solution of a single task [8]. Since efficient utilization of a horde of processors 
requires a lot of communication, sorting networks have been devised [2,21] which allow every 
processor in an N-processor system to both send and receive a message on every clock pulse. 
Furthermore, the transit time through the network is on the average only O(log N) and the size of 
the network only 0(N log 2 N). However, it is still not clear how to effectively utilize all of these 
processors. 

Other researchers [11,5,22] quite rightly note that languages without side-effects, e.g. "pure" LISP, 
are excellently suited for the purpose of representing many algorithms intended for execution on a 
host of processors since their lack of side-effects eliminates a great source of complexity in parallel 
execution. Thus, "Church-Rosser" theorems can be proved which ensure the invariance of the 
value of an expression regardless of the order or relative speed of evaluation. However, we must 
keep in mind that this kind of parallelism does not implement the most general form of 
communication between processes. For example, an airline reservation system cannot be 
implemented in such a language, due to its non-determinate behavior. 

In this paper, we consider an "eager beaver" evaluator for a language without side-effects, such as 
pure LISP. When an expression of the language is given to the evaluator by the user, the 
evaluator evaluates it and all of its subexpressions as soon as possible, and in parallel. The 
evaluator does this by creating and returning for each subexpression a future, which is a promise 
to deliver the value of that subexpression at some later time, if it has a value. Each future can 
evaluate its subexpression independently and concurrently with other futures because it is created 
with its own evaluator process, which is dedicated to evaluating its subexpression. When the value 
of a future is needed explicitly, e.g. by the primitive function V\ the evaluator of the 
subexpression may or may not be complete. If it is complete, the future's value is immediately 
available; if not, the requesting process is forced to wait until the evaluation of the subexpression is 
done. 

Futures are created recursively in the evaluation of an expression by our eager evaluator whenever 
it encounters functional application. A new future is created for each argument, resulting in the 
parallel (collateral) evaluation of those arguments, while the main process tackles the job of 
evaluating the function position and applying it to the tuple of argument futures. We call the 
main evaluator process the parent, while the futures it directly creates are its offspring. 

More precisely, a future is a triple (process, cell, waiting room), where process is the virtual 
processor initialized to evaluate an argument expression in its proper environment, cell is a writable 
location in memory which will save the value of the argument when it is ready, to avoid 
recomputing it, and waiting room is a set of processes which are waiting for the value of this 
future. 

When the future is created, its process starts evaluating the subexpression in the given 
environment. If any other process needs the value of this future before it is ready, the requesting 
process enters the waiting room of the future and goes to sleep. When the value promised by the 
future is ready, its process stores that value into the future's cell, wakes up all of the processes in 
the future's waiting room, and dies. Henceforth, any process needing this future's value can find it 
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in the future's cell, without waiting or performing any further computation. 

Notice that a future of an expression is different from the delayed value [26,23,12,15,10] of the 
expression in that the latter are designed to delay evaluation of the expression until the value is 
needed while a future immediately dedicates a processor to evaluating the expression. This 
difference is both a strength and a weakness of eager evaluation. 

The main problem with our eager interpreter is that it can be wasteful, because it anticipates which 
values are going to be required to compute the final result. For example, a process may be assigned 
to the computation of a future whose value will never be needed; in this case, we say that the 
process is irrelevant. If there were no way to determine irrelevancy, these irrelevant processes could 
tie up a significant amount of computing power. Furthermore, if a process was assigned to evaluate 
a non-terminating expression, its computational power would be lost to the system forever! In the 
following sections, we argue that the "garbage collection" of passive storage can be extended to the 
reclamation of these irrelevant active processes. Furthermore, we show that this garbage collection 
can be done incrementally, thus eliminating the long delays classically associated with garbage 
collection. 

2. Garbage Collecting Irrelevant Futures 

A classical garbage collector for passive storage starts by marking the root of the heap of passive 
storage nodes, and proceeds by propagating marks from marked nodes to their offspring, until 
there is no unmarked node with a marked parent. Upon the completion of this process, any nodes 
which are still unmarked are not accessible from the root; hence they are declared garbage and 
returned to the list of available free nodes. 

The key to garbage collecting active processes is that active process's process-states are addressable 
as vectors of words in the common address space of all the processors, but marked with a special 
type code. These vectors store the contents of the registers of the processes, plus additional status 
information. We claim that processes whose process-states become inaccessible from the root are 
irrelevant and should be reclaimed. The top-level process-that assigned to the top level future-is 
always relevant since the user expects an answer, and therefore it is always directly accessible from 
the root of the heap. Any offspring of this future whose values are still required are accessible to 
it. Hence by induction, relevant processes remain accessible from the root. If a future becomes 
inaccessible from the root, then no other process can access its value-even when it is ready— and 
hence the future and its process are irrelevant. 

In order that all irrelevant processes be identified as soon as possible, we must make sure that all 
processes classified as accessible are truly relevant to the computation. An example of a process 
which is accessible but irrelevant is that of a loiterer, i.e. a process which is accessible only through 
the "waiting room" of some future. A loiterer is waiting for the value of one or more futures, but 
whose value is not needed by any other process. Loiterers cannot be immediately garbage collected 
because of the outstanding waiting-room pointers to them, but they need not be restarted. If a 
loiterer is encountered in a second garbage collection, it can at that time be reclaimed. Hence, 
waiting-room accessibility is a second-class form of accessibility which wiil not protect a loiterer 
from eventually being garbage collected. 
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If busy waiting is used, waiting rooms are not necessary, and thus their pointers do not requir 
special treatment by the garbage collector. However, busy waiting requires that a high price be 
paid for communication channels between the waitor and the waitee, because the incessant queries 
clog these channels. 

Garbage collection is made incremental by using some of the ideas from an earlier paper [1], which 
in turn is based on the work of Dijkstra [6,7] and Lamport [17,18]. The mark phase of our 
incremental garbage collector process employs three colors for every object—white, grey, and black. 
Intuitively, luhite nodes are not yet known to be accessible, grey nodes are known to be accessible, 
but whose offspring have not yet been checked, and black nodes are accessible, and have accessible 
offspring. Initially, all nodes (including processes) are white. A white node is made grey by 
shading it; i.e. making it "at least grey" [6], while a grey node is marked by shading its offspring 
and making the node black-both indivisible processes. Marking is initiated by stopping all 
processes and shading the root. Marking proceeds by finding a grey node, shading its offspring, 
then making that node black. When there are no more grey nodes, garbage collection is done; all 
still-white nodes are then emancipated and the colors white and black switch interpretations. 

Although all user processes must be stopped when garbage collection is begun, a user process can 
be restarted as soon as it has been blackened by the collector. Since the top-level process is pointed 
at directly by the root of the heap, it is restarted almost immediately. It should be obvious that 
when a process first becomes black, it cannot point directly at a white node. We wish to preserve 
this assertion. Therefore, whenever a running black process is about to violate it— by inserting into 
one of its registers the white component of a node it is already pointing at— it immediately shades 
the white node before proceeding. Furthermore, every new node the process needs is created black. 
The intuitive rationale behind these policies is that so far as any black process is concerntd, the 
garbage collection has already finished. Furthermore, the nodes which are found accessible by the 
garbage collector are exactly those which were accessible at the time the garbage collection was started. 

We prove the correctness of this garbage collector informally. The garbage collector is given a 
head start on all of the processes because they are stopped when it is started. When a process is 
restarted, it is black, and everything it sees is at least grey, hence it is in the collector's wake. 
Whenever a process attempts to catch up to the collector by tracing an edge from a node it can 
access directly, that node is immediately shaded. Therefore, it can never pass or even catch the 
collector. Since the collector has already traced any node a process can get its registers on, the 
process cannot affect the connectivity of the nodes that the collector sees. Because white or grey 
processes are not allowed to run, any created nodes are black, and since nodes darken 
monotonically, the number of white nodes must monotonically decrease, proving termination. 

Our garbage collector has only one phase-the mark phase-because it uses a compacting, copying 
algorithm [9,4] which marks and copies in one operation. This algorithm copies accessible list 
structures from an "old semispace" into a "new semispace". As each node is copied, a "forwarding 
address" is left at its old address in the old semispace. If the Minsky copying algorithm is used [9], 
the collector has its own stack to keep track of grey nodes; the Cheney algorithm [4] uses a "scan 
pointer" to linearly scan the new semispace, while updating the pointers of newly moved nodes by 
moving the nodes they point to. The correspondence between our coloring scheme and these 
algorithms is this: white nodes are those which reside in the old semispace; grey nodes are those 
which have been copied to the new semispace, but whose outgoing pointers have not been updated 
to point into the new semispace (i.e. have not yet been encountered by the scan pointer in the 
Cheney algorithm); and black nodes are those which have been both moved and updated (i.e. are 
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behind the scan pointer). When scanning is done (i.e. there are no more grey nodes and all 
accessible nodes have been copied), the old and new semispaces then interchange roles. 
Reallocating processors is simple; all processors are withdrawn at the start of garbage collection, and 
are allocated to each process as it is blackened. Thus, when the garbage collection has finished, all 
and only relevant processes have been restarted. 

The restriction that white or grey processes cannot run can be relaxed to allow white processes to 
run so long as a white process does not cause a black node to point to a white one. This can only 
happen if the white process is trying to perform a side-effect (RPLACA or RPLACD in LISP) on 
a black node. If operations of this type are suspended until either the process either becomes black 
or is garbage-collected, then proper garbage collector operation can be ensured, and convergence 
guaranteed. Under these conditions, a process creates new cells of its own color, i.e. white processes 
create only white cells. When a white process is encountered by the garbage collector, it must stop 
and allow itself to be colored black before continuing. 

The notion that processes must be marked as well as storage may explain some of the trouble that 
Dijkstra and Lamport had when trying to prove their parallel garbage collection algorithm correct 
[6,7,17,18]. Since their algorithm does not mark a user process by coloring it black (thereby 
prohibiting it from directly touching white nodes), and allows these white processes to run, the 
proof that the algorithm collects only and all garbage is long and very subtle (see [18]). 

3. Coroutines and Generators 

One problem with our "eager beaver" evaluator is that some expressions which have no finite 
values wilt continue to be evaluated without mercy. Consider, for example, the infinite sequence of 
squares of integers 0,1,4,9,... We give below a set of LISP-like functions for computing such a list. 

squares-begining-with s (Xx. (cons (* x x) (squares-begin ing- with (♦ x 1)))) ; Compute an element. 

cons s (Xx y. ; Define CONS function. 

(Xmessage. 

(if (= message 'car) x 
(>= message 'cdr) y))) 

car = (Xx. (x 'car)) ; Ask for first component. 

cdr s (Xx. (x 'cdr)) ; Ask for second component. 

list-of -squares = (squares-beginning-with 0) ; Start the recursion. 

The evaluation of "(squares-beginning-with 0)" will start off a future evaluating "(cons ...)", which 
will start up another future evaluating "(squares-beginning-with 1)", and so forth. Since this 
computation will not terminate, we might worry whether anything useful will ever get done. One 
way to ensure that this computation will not clog the system is to convert it into a "lazy" 
computation [26,23,12,10] by only allowing it to proceed past a point in the infinite list when 
someone forces it to go that far. This can be easily done by performing a lambda abstraction on 
the expression whose evaluation is to be delayed. Since our evaluator will not try to further 
evaluate a X-expression, this will protect its body from evaluation by our eager beavers. 



Baker and Hewitt The Incremental Garbage Collection of Processes 28 December 1977 



squares-begining-with' = 
(Ax. (cons (>:< x x) 

(Amessage. ((squares-begining-with' (♦ x 1)) message)))) ; Protect from early evaluation. 

However, this technique is not really necessary if we use an exponential scheduler for the proportion 
of effort assigned to each process. This scheduler operates recursively by assigning 100?. of the 
system effort to the top-level future, and whenever this future spawns new futures, it allocates only 
507. of its allowed effort to its offspring. While a process is in the waiting room of a future, it 
lends it processing effort to the computation of that future. However, a future which finishes 
returns its effort to helping the system-not its siblings. Now the set of futures can be ordered 
according to who created whom and this ordering forms a tree. As a result of our exponential 
scheduling, the further down in this tree a future is from the top-level future, the lower its priority 
in scheduling. Therefore, as our eager beavers produce more squares, they become exponentially 
more discouraged. But if other processes enter the waiting room for the square of a large number, 
they lend their encouragement to its computation. 

Call-by-future evaluation provides for the maximal concurrency possible in evaluating the 
expressions of a language. It can provide more parallelism than current data flow machines [0,5] or 
"eventual values" [16]. For example consider the following program which computes the square root 
of the sum of the squares of its arguments 

f = (Ax y. (square-root (+ (* x x) (* y y)))) 

Note that in computing the value of an expression such as the following (k 3 (f (h 3) (g 4))) that 
the iquare of (h 3) can be performed in parallel with the square of (g 4). In addition the square 
root of the sum of these values might be performed after the function k has been entered! Thus 
there is a great deal of potential concurrency in the evaluation of the above expression. 

In an evaluator which uses call-by-future for CONS, the obvious program for MAPCAR (the 
LISP analog of APL's parallel application of a function to a vector of arguments) will 
automatically do all of the function applications in parallel in a "pipe-lined" fashion. However, 
due to the scheduler the values earlier in the list will be accorded more effort than the later ones. 

Because this scheduler is not omniscient, system effort will still have to be reallocated by the 
garbage collector as it discovers irrelevant processes and returns their computing power to help 
with still relevant tasks. 

•4. Time and Space 

"Lazy" evaluation [26,12,10] is an optimal strategy [23,3] for evaluating expressions on a single 
processor, in the sense that the minimum number of reductions (procedure calls) are made. 
However, when more than one processor is available to evaluate the expression, it is not clear what 
strategy would be optimal. If nothing is known about the particular expression being evaluated, we 
Conjecture that any reasonable strategy must allocate one processor to lazy evaluation, with the 
other processors performing eager evaluation. We believe that our "eager beaver" evaluator 
implements this policy, and unless the processors interfere with one another excessively, a 
computation must always run faster with an eager evaluator running on multiple processors than a 
lazy evaluator running on a single processor. If there are not enough processors to allocate one for 
every future, then we believe that our "exponential scheduling" policy will do a good job of 
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dynamically allocating processor effort where it is most needed. 

Although the universal creation of futures should reduce the time necessary to evaluate an 
expression, we must consider how the space requirements of this method compare with other 
methods. The space requirements of futures are hard to calculate because under certain schedules, 
future order evaluation approximates call-by-value, while with other schedules, it is equivalent to 
call-by-name (but evaluated only once). In the worst case, the space requirements of futures can be 
arbitrarily bad, depending upon the relative speed of the processors assigned to non-terminating 
futures. 

5. The Power of Futures 

The intuitive semantics associated with a future is that it runs asynchronously with its parent's 
evaluation. This effect can be achieved by either assigning a different processor to each future, or 
by multiplexing all of the futures on a few processors. Given one such implementation, the 
language can easily be extended McCarthy [19] with a construct having the following form: 
"(EITHER <ej> <e2> ... <e n > ) M means evaluate the expressions <e|> in parallel and return the value 
of "the first one that finishes". Ward [24] shows how to give a Scott-type lattice semantics for this 
construct. He starts with a power-set of the base domain and gives it the usual subset lattice 
structure, then extends each primitive function to operate on sets of elements from the base domain 
in the obvious way, and finally defines the result of the EITHER construct to be the least upper 
bound (LUB) of all the <ej> in the subset lattice. The EITHER construct is approximated 1 by 
spawning futures for all the <ej>, and polling them with the parent process until the first one 
finishes. At that point, its answer is returned as the value of the "EITHER" expression, and the 
other futures become inaccessible from the root of the heap. 

We give several examples of the power of the "EITHER" construct: 

(multiply x y) s (EITHER (if x=0 then else (loop)) 

(if y=0 then else (loop)) 
(* x y)) 

(integrate exp bound-variable) = 

(EITHER (fast-heuristic-integrate exp bound-variable) 
(Risch-integrate exp bound-variable)) 

The first example is that of a numeric product routine whose value is zero if either of its 
arguments are zero, even if the non-zero argument is undefined. The second example is an 
integration routine for use in a symbolic manipulation language like Macsyma, where there is a 
relatively fast heuristic integration routine which looks for common special cases, and a general but 
slow decision procedure called the Risch algorithm. Since the values of both methods are 
guaranteed to be the same (assuming that they perform integration properly), we need not worry 
about the possibility of non-determinacy of the value of this expression (i.e. non-singleton subsets 
of the base domain in Ward's lattice model). 



1. This implementation is only an approximation because only singleton sets of elements of the base domain 
can ever be returned. 
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One may ask what the power of such an "EITHER" construct is; i.e. does it increase the expressi ; 
power of the language in which it is embedded? A partial answer to this question has been give* 
with respect to "uninterpreted" schemata. Uninterpreted schemata answer questions about the 
expressive power of programming language constructs which are implicit in the language, rathet 
than being simulated. For example, one can compare the power of recursion versus iteration in a 
context where stacks cannot be simulated. Hewitt and Paterson [13] have shown that uninterpreted 
"parallel" schemata are strictly more powerful than recursive schemata. The essence of this 
difference is that parallel schemata can simulate non-deterministic computation without bogging 
down in some infinite branch by following all branches in parallel. 

Also, Ward [24] has shown that the "EITHER" construct strictly increases the power of the 
X-calculus in the sense that there exist functions over the base domain which are inexpressible 
without "EITHER", but are trivially expressible with it. 

6. Shared Data bases 

The advantage that garbage collection has over the explicit killing of processes becomes apparent 
when parallel processes have access to a shared data base. These data bases are usually protected 
from inconsistency due to simultaneous update by a mutual exclusion method. However, if some 
process were to be killed while it was inside such a data base, the data base would remain locked, 
and hence unresponsive to the other processes requesting access. 

The solution we propose is for the data base to always keep a list of pointers to the processes which 
it has currently inside. In this way, an otherwise irrelevant process will be accessible so long as it is 
inside an accessible data base. However, the moment it emerges, it will be forgotten by the data 
base, and subject to reclamation by garbage collection. The crowds component of a serializer, a 
synchronization construct designed to manage parallel access to a shared data base [14], 
automatically performs such bookkeeping. 

7. Conclusions 

We have presented a method for managing the allocation of processors as well as storage to the 
subcomputations of a computation in a way that tries to minimize the elapsed time required. This 
is done by anticipating which subcomputations will be needed and starting them running in 
parallel, before the results they compute are needed. Because of this anticipation, subcomputations 
may be started whose results are not needed, and thus our method identifies and revokes these 
allocations of storage and processing power through an incremental garbage collection method. 

The scheme presented here assumes that all of the processors reside in a common, global address 
space, like that of HYDRA [25]. Since networks of local address spaces look promising for the 
future, methods for garbage collecting those systems need to be developed. 
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2. However, since Algol-68 does not support "returned functional values", A scheme in the language need 
not use garbage collection to discover irrelevant "cventuals". They can be coerced into values before being 
returned as the value of a procedure, and hence processor allocation can use a LIFO scheme like that used 
for storage of the activation records on the slack. However a certain amount of concurrency can be lost 
by enforcing this coercion. 
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